System for estimating parameters of a gaussian mixture model

ABSTRACT

A signal processing system is disclosed which is implemented using Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), or a GMM alone, parameters of which are constrained during its optimization procedure. Also disclosed is a constraint system applied to input vectors representing the input signal to the system. The invention is particularly, but not exclusively, related to speech recognition systems. The invention reduces the tendency, common in prior art systems, to get caught in local minima associated with highly anisotropic Gaussian components—which reduces the recognizer performance—by employing the constraint system as above whereby the anisotropy of such components are minimized. The invention also covers a method of processing a signal, and a speech recognizer trained according to the method.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to a system and method for processing signals toaid their classification and recognition. More specifically, theinvention relates to a modified process for training and using bothGaussian Mixture Models and Hidden Markov Models to improveclassification performance, particularly but not exclusively with regardto speech.

2. Description of the Art

Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are oftenused in signal classifiers to help identify an input signal when given aset of example inputs, known as training data. Uses of the techniqueinclude speech recognition, where the audio speech signal is digitisedand input to the classifier, and the classifier attempts to generatefrom its vocabulary of words the set of words most likely to correspondto the input audio signal. Further applications include radar, whereradar signal returns from a scene are processed to provide an estimateof the contents of the scene, and in image processing. PublishedInternational specification WO02/08783 demonstrates the use of HiddenMarkov Model processing of radar signals.

Before a GMM or HMM can be used to classify a signal, it must be trainedwith an appropriate set of training data to initialise parameters withinthe model to provide most efficient performance. There are thus twodistinct stages associated with practical use of these models, thetraining stage and the classification stage. With both of these stages,data is presented to the classifier in a similar manner. When applied tospeech recognition, a set of vectors representing the speech signal istypically generated in the following manner. The incoming audio signalis digitised and divided into 10 ms segments. The frequency spectrum ofeach segment is then taken, with windowing functions being employed ifnecessary to compensate for truncation effects, to produce a spectralvector. Each element of the spectral vector typically measures thelogarithm of the integrated power within each different frequency band.The audible frequency range is typically spanned by around 25 suchcontiguous bands, but one element of the spectral vector isconventionally reserved to measure the logarithm of the integrated poweracross all frequency bands, i.e. the logarithm of the overall loudnessof the sound Thus, each spectral vector conventionally has around25+1=26 elements; in other words, the vector space is conventionally26-dimensional. These spectral vectors are time-ordered and constitutethe input to the HMM or GMM, as a spectrogram representation of theaudio signal.

Training both the GMM and HMM involve establishing an optimised set ofparameters associated with the processes using training data, such thatoptimal classification occurs when the model is subjected to unseendata.

A GMM is a model of the probability density function (PDF) of its inputvectors (e.g. spectral vectors) in their vector space, parameterised asa weighted sum of Gaussian components, or classes. Available parametersfor optimisation are the means and covariance matrices for each class,and prior class probabilities. The prior class probabilities are theweights of the weighted sum of the classes. These adaptive parametersare typically optimised for a set of training data by an adaptive,iterative, re-estimation procedure such as the Expectation Maximisation(EM), and log-likelihood gradient ascent algorithms, which are wellknown procedures for finding a set of values for all the adaptiveparameters that maximises the training-set average of the logarithm ofthe model's likelihood function (log-likelihood). These iterativeprocedures refine the values of the adaptive parameters from oneiteration to the next, starting from initial estimates, which may justbe random numbers lying in sensible ranges.

Once the adaptive parameters of a GMM have been optimised, those trainedparameters may subsequently be used for identifying the most likely ofthe set of alternative models for any observed spectral vector, i.e. forclassification of the spectral vector. The classification step involvesthe conventional procedure for computing the likelihood that eachcomponent of the GMM could have given rise to the observed spectralvector.

Whereas a GMM is a model of the PDF of individual input vectorsirrespective of their mutual temporal correlations, a HMM is a model ofthe PDF of time-ordered sequences of input vectors. The adaptiveparameters of an ordinary HMM are the observation probabilities (the PDFof input vectors given each possible hidden state of the Markov chain)and the transition probabilities (the set of probabilities that theMarkov chain will make a transition between each pair-wise combinationof possible hidden states).

A HMM may model its observation probabilities as Gaussian PDFs(otherwise known as components, or classes) or weighted sums of GaussianPDFs, i.e. as a GMM. Such HMMs are known as GMM based HMMs. Theobservation probabilities of a GMM-based HMM are parameterised as a GMM,but the GMM-based HMM is not itself a GMM. An input stage can be addedto a GMM based HMM however, where this input stage comprises a simpleGMM. The log-likelihood of a GMM-based HMM is the log-likelihood of anHMM whose observation probabilities are constrained to be parameterisedas GMMs; it is not the log-likelihood of a GMM. Consequently, theoptimisation procedure of a GMM-based HMM is not the same as that of aGMM. However, a prescription for optimising a GMM based HMM'sobservation probabilities can be re-cast as a prescription foroptimising the associated GMM's class means, covariance matrices andprior class probabilities.

Training, or optimisation, of the adaptive parameters of a HMM is doneso as to maximise the overall likelihood function of the model of theinput signal, such as a speech sequence. One common way of doing this isto use the Baum-Welch re-estimation algorithm, which is a development ofthe technique of expectation maximisation of the model's log-likelihoodfunction, extended to allow for the probabilistic dependence of thehidden states on their earlier values in the speech sequence. A HMM isfirst initialised with initial, possibly random, assumptions for thevalues of the transition and observation probabilities.

For each one of a set of sequences of input training vectors, such asspeech-sequences, the Baum-Welch forward-backward algorithm is applied,to deduce the probability that the HMM could have given rise to theobserved sequence. On the basis of all these per-sequence modellikelihoods, the Baum-Welch re-estimation formula updates the model'sassumed values for the transition probabilities and the observationprobabilities (i.e. the GMM class means, covariance matrices and priorclass probabilities), so as to maximise the increase in the model'saverage log-likelihood. This process is iterated, using the Baum-Welchforward-backward algorithm to deduce revised model likelihoods for eachtraining speech-sequence and, on the basis of these, using theBaum-Welch re-estimation formula to provide further updates to theadaptive parameters.

Each iteration of the conventional Baum-Welch re-estimation procedurecan be broken down into five steps for every GMM-based HMM: (a) applyingthe Baum-Welch forward-backward algorithm on every trainingspeech-sequence, (b) the determination of what the updated values of theGMM class means should be for the next iteration, (c) the determinationof what the updated values of the GMM class covariance matrices shouldbe for the next iteration, (d) the determination of what the updatedvalues of the GMM prior class probabilities should be for the nextiteration, and (e) the determination of what the updated values of theHMM transition probabilities should be for the next iteration. Thus, theBaum-Welch re-estimation procedure for optimising a GMM-based HMM can bethought of as a generalisation of the EM algorithm for optimising a GMM,but with the updated transition probabilities as an extra, fourthoutput.

For certain applications, HMMs are employed that do not have theirobservation probabilities parameterised as GMMs, but instead use lowerlevel HMMs. Thus, a hierarchy is formed that comprises at the top a“high level” HMM, and at the bottom a GMM, with each layer having itsobservation probabilities defined by the next stage down. This techniqueis common in subword-unit based speech recognition systems, where thestructure comprises two nested levels of HMM, with the lowest one havingGMM based observation probabilities.

The procedure for optimising the observation probabilities of ahigh-level HMM reduces to the conventional procedure for optimising boththe transition probabilities and the observation probabilities (i.e. theGMM parameters) of the ordinary HMMs at the lower level, which is asdescribed above. The procedure for optimising the high-level HMM'stransition probabilities is the same as the conventional procedure foroptimising ordinary HMMs' transition probabilities, which is asdescribed above.

HMMs can be stacked into multiple-level hierarchies in this way. Theprocedure for optimising the observation probabilities at any levelreduces to the conventional procedure for optimising the transitionprobabilities at all lower levels combined with the conventionalprocedure for optimising the GMM parameters at the lowest level. Theprocedure for optimising the transition probabilities at any level isthe same as the conventional procedure for optimising ordinary HMMs'transition probabilities. Thus, the procedure for optimisinghierarchical HMMs can be described in terms of recursive application ofthe conventional procedures for optimising the transition andobservation probabilities of ordinary HMMs.

Once the HMM's adaptive parameters have been optimised, the trained HMMmay subsequently be used for identifying the most likely of a set ofalternative models of an observed sequence of input vectors—spectralvectors in the case of speech classification, and complex amplitude orimage data in the case of radar and other images. This processconventionally is achieved using the Baum-Welch forward-backwardalgorithm, which computes the likelihood of generating the observedsequence of input vectors from each of a set of alternative HMMs withdifferent optimised transition and observation probabilities.

The classification methods described above have certain disadvantages.When optimising the observation probabilities of the GMMs, and hence ofthe HMMs that may be hierarchically above them, as well as thetransition probabilities of the HMM, there is a tendency for theoptimisation to get caught in local minima, which prevents the systemfrom achieving optimal classification. This can often be attributed to atendency for class likelihood-PDFs to become “tangled up” with oneanother if they are free to become too highly anisotropic. Also,regarding speech recogniser technology, current recognisers are poor atcapturing subtle variations and intrinsic characteristics of realspeech, such as the full, specific variability of speakers' vowels undervery different speaking conditions. In particular, individual vowelsoccupy complex shapes in spectral vector space, and attempting torepresent these shapes as Gaussian distributions, as is conventionallydone, can lead to unfaithful representation of the speech sounds.

SUMMARY OF THE INVENTION

According to the present invention there is provided a signal processingsystem for processing a plurality of multi-element data encodingvectors, the system:

-   -   having means for deriving the data encoding vectors from input        signals;    -   being arranged to process the data encoding vectors using a        Gaussian Mixture Model (GMM), the GMM having at least one class        mean vector having multiple elements;    -   being arranged to process the elements of the class mean        vector(s) by an iterative optimisation procedure;        characterised in that the system is also arranged to scale the        elements of the class mean vector(s) during the optimisation        procedure to provide for the class mean vector(s) to have        constant modulus at each iteration, and to normalise the data        encoding vectors input to the GMM.

Preferably the moduli of the mean vectors of each of the GMMs arerescaled after each iteration of the optimisation procedure so that theyare all of equal value.

Most signal processing systems of the type discussed in thisspecification incorporate a GMM that represents the probability densityfunction of all data encoding vectors in the training sequence. Theconstraint of limiting the elements of the class mean vector to haveconstant modulus leads to simplified processing of the GMMs making upthe signal processing system, as the class means of each GMM will lie onthe surface of a hypersphere having dimensionality (n−1), where n is thedimension of an individual vector.

Preferably a covariance matrix associated with the GMM is constrained soas to be isotropic and diagonal, and to have a variance constrained tobe a constant value. This eliminates the possibility of certain classesof severe local minima associated with highly anisotropic Gaussiancomponents, and so prevents such sub-optimal configurations from formingduring the training process. Note that a covariance matrix that is soconstrained may be regarded mathematically as a scalar value, and hencea scalar value may be used to represent such a covariance matrix.

Eliminating certain classes of local minima, by employing the novelconstraints of the present invention, may have very significant andnovel extra advantages (over and above the need to limit or avoid localminima if possible) under certain circumstances. These circumstancesoccur whenever the probability distribution function (PDF) of thedata-encoding vectors is invariant with respect to orthogonal symmetriessuch as permutation transformations. Eliminating certain classes oflocal minima by employing the novel constraints of the present inventionmay, under these circumstances, enable the class means of the GMMthemselves to become symmetric under these same symmetry transformationsafter adaptation procedures such as the well-known expectationmaximisation (EM) algorithm. This provides a means for such adaptationprocedures to derive GMMs whose posterior class probabilities areinvariant with respect to these same symmetry transformations; thisattribute will be useful for producing transformation-robust patternrecognition systems.

Each GMM, and therefore GMM based HMM, has a set of prior classprobabilities. Preferably the prior class probabilities associated withthe GMM are constrained to be equal, and to remain constant throughoutthe optimisation procedure.

Prior art signal processing systems incorporating GMMs generally avoidputting constraints on the model parameters; other than that covariancematrices are on occasion constrained to be equal across classes,requirements are rarely imposed on the class means, covariance matrices,prior class probabilities and hidden-state transition probabilitiesother than that their values are chosen to make the averagelog-likelihood as large as possible.

Preferably, each data encoding vector that is also an input vector,derived from the input signal during both training and classifyingstages of using the GMM is constrained such that its elements x_(i) areproportional to the square roots of the integrated power withindifferent frequency bands. Advantageously, the elements of each suchdata encoding vector are scaled such that the squares of the elements ofthe vector sum to a constant value that is independent of the totalpower of the original input.

Preferably each such data encoding vector is augmented with the additionof one or more elements representing the overall power in the vector.The scaling of the vector elements described above removes anyindication of the power, so the additional element(s) provide the onlyindication of the power, or loudness, within the vector. Clearly, thecomputation of the value of the elements representing power would needto be based on pre-scaled elements of the vector.

Note that in this specification the terms “input vector” and “spectralvector” are used interchangeably in the context of providing an input tothe lowest level of the system hierarchy. The vector at this level mayrepresent the actual power spectrum of the input signal, and hence bespectral coefficients, or may represent some modified form of the powerspectrum. In practice, the input vector will generally represent a powerspectrum of a segment of a temporal input signal, but this will not bethe case for all applications. Further processing of the temporal inputsignal is used in some applications, e.g. cosine transform. A “dataencoding vector” is, within this specification, any vector that is usedas an input to any level of the hierarchy, depending on the context,i.e. any vector that is used as the direct input to the particular levelof the hierarchy being discussed in that context. A data encoding vectoris thus an input vector only when it represents the information enteringthe system at the lowest level of the hierarchy.

Note also that normalising a vector is the process of rescaling all itselements by the same factor, in order to achieve some criterion definedon the whole vector of elements. What that factor is depends on thecriterion chosen for normalisation. A vector can generally be normalisedby one of two useful criteria; one is to normalise such that theelements sum to a constant after normalisation, the other is tonormalise such that the squares of the elements sum to a constant afternormalisation. By the first criterion, the resealing factor should beproportional to the reciprocal of the sum of the values of the elementsbefore normalisation. By the second criterion, the resealing factorshould be proportional to the reciprocal of the square root of the sumof the squares of the values of the elements before normalisation. Avector of exclusive probabilities is an example of a vector normalisedby the first criterion, such that the sum of those probabilities is 1. A(real-valued) unit vector is an example of a vector normalised accordingto the second criterion; the sum of the squares of the elements of a(real-valued) unit vector is 1. A vector whose elements comprise thesquare roots of a set of exclusive probabilities is also an example of avector normalised by the second criterion.

Note that for the purposes of this specification, any reference to GMMsshould be taken to include Exponential Mixture Models (EMMs). EMMs, maybe regarded as a special case of GMMs because one can derive equationsand procedures for optimising simple EMMs and EMM based HMMs by settingconstant the moduli |x| and |w| of the GMM's data-encoding vector andclass means respectively and constructing the GMM's covariance matrix tobe isotropic in the conventional EM algorithm for simple GMMs or theconventional Baum-Welsh re-estimation procedure for GMM based HMMs.Nevertheless, the equations and procedures so derived are valid for EMMseven when |x| and |w| are not constant, and they constitute validprescriptions for optimising general EMMs.

According to a further aspect of the present invention there is provideda signal processing system for processing a plurality of multi-elementdata encoding vectors, the system:

-   -   having means for deriving the data encoding vectors from input        signals;    -   being arranged to process the data encoding vectors using a        Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM),        the GMM based HMM having at least one class mean vector having        multiple elements;    -   being arranged to process the elements of the class mean        vector(s) by an iterative optimisation procedure;        characterised in that the system is also arranged to scale the        elements of the class mean vector(s) during the optimisation        procedure to provide for the class mean vector(s) to have        constant modulus at each iteration, and to normalise the data        encoding vectors input to the GMM based HMM.

The invention as described herein may equally well be applied to asystem that employs only GMMs, or that employs GMM based HMMs, or indeedthat employs GMM based HMMs whose data-encoding vectors are derived fromthe posterior class probabilities of separate, low level, GMMs.

Note that the constraints and conditions that may be imposed on GMMparameters as discussed above, including the mean vectors and covariancematrix, and prior class probabilities, may also be imposed on equivalentparameters of the GMM based HMM. Likewise, the processing applied todata encoding vectors as described above for use with a GMM based systemmay equally well be applied to a GMM based HMM system.

Certain applications, notably subword-unit based models, advantageouslyemploy a HMM that uses as its observation probability a GMM constrainedaccording to the current invention, wherein the HMM acts as theobservation probability for a further HMM. In this way, a hierarchy ofHMMs can be built up, in the manner of the prior art, but with thedifference that the constraints on the model parameters according to thecurrent invention are applied at each level of the hierarchy.

Advantageously, the hierarchy may incorporate two GMMs as two lowerlevels, with a HMM at the highest level. The lowest level GMM providesposterior probabilities as a data encoding vector to a second, higherlevel GMM. This second GMM provides observation probabilities to a HMMat the third level. This arrangement allows individual speech-sounds tobe represented in the spectral-vector space not as individual Gaussianellipsoids, as is conventional, but as assemblies of many smallerGaussian hypercircles tiling the unit hypersphere, offering in thepotential for more faithful representation of highly complex-shapedspeech-sounds, and thus improved classification performance.

According to another aspect of the current invention there is provided amethod of processing a signal, the signal comprising a plurality ofmulti-element data encoding vectors, wherein the data encoding vectorsare derived from an analogue or digital input, and where the methodemploys at least one Gaussian Mixture Model (GMM) or GMM based HiddenMarkov Model (HMM), the GMM or GMM based HMM having at least one classmean vector having multiple elements, and the elements of the class meanvector(s) are optimised in an iterative procedure, characterised in thatthe elements of the class mean vectors are scaled during theoptimisation procedure such that the class mean vectors have a constantmodulus at each iteration, and the data encoding vectors input to theGMM or GMM based HMM are processed such that they are normalised.

Note that the user(s) of a system trained according to the method of thecurrent invention may be different to the user(s) who performed thetraining. This is due to the distinction between the training and theclassification modes of the invention

According to another aspect of the current invention there is provided acomputer program designed to run on a computer and arranged to implementa signal processing system for processing one or more multi-elementinput vectors, the system:

-   -   having means for deriving the data encoding vectors from input        signals;    -   being arranged to process the data encoding vectors using at        least one of a Gaussian Mixture Model (GMM) and a GMM based        Hidden Markov Model (HMM), the GMM or GMM based HMM having at        least one class mean vector having multiple elements;    -   being arranged to process the elements of the class mean        vector(s) by an iterative optimisation procedure;        characterised in that the system is also arranged to scale the        elements of the class mean vector(s) during the optimisation        procedure to provide for the class mean vector(s) to have        constant modulus at each iteration, and to normalise the data        encoding vectors input to the GMM or GMM based HMM.

The present invention can be implemented on a conventional computersystem. A computer can be programmed to so as to implement a signalprocessing system according to the current invention to run on thecomputer hardware.

According to another aspect of the current invention there is provided aspeech recogniser incorporating a signal processing system forprocessing one or more multi-element input vectors, the recogniser:

-   -   having means for deriving the data encoding vectors from input        signals;    -   being arranged to process the data encoding vectors using at        least one of a Gaussian Mixture Model (GMM) and a GMM based        Hidden Markov Model (HMM), the GMM or GMM based HMM having at        least one class mean vector having multiple elements;    -   being arranged to process the elements of the class mean        vector(s) by an iterative optimisation procedure;        characterised in that the system is also arranged to scale the        elements of the class mean vector(s) during the optimisation        procedure to provide for the class mean vector(s) to have        constant modulus at each iteration, and to normalise the data        encoding vectors input to the GMM or GMM based HMM.

A speech recogniser may advantageously incorporate a signal processingsystem as described herein, and may incorporate a method of signalprocessing as described herein.

DESCRIPTION OF THE FIGURES

The current invention will now be described in more detail, by way ofexample only, with reference to the accompanying Figures, of which:

FIG. 1 diagrammatically illustrates a typical hardware arrangementsuitable for use with the current invention when implemented in a speechrecogniser.

FIG. 2 shows in block diagrammatic form the conventional re-estimationprocedure adopted by the prior art systems employing GMM or HMM basedclassifiers;

FIG. 3 shows in block diagrammatic form one of the pre-processing stagescarried out on input vectors based on frames of speech, relating to theframe's spectral shape;

FIG. 4 shows in block diagrammatic form a further pre-processing stagecarried out on the input vectors relating to the overall loudness of aframe of speech;

FIG. 5 shows in block diagrammatic form the modified re-estimationprocedure of GMMs or ordinary, or hierarchical HMMs as per the currentinvention;

FIG. 6 shows in more detail the class mean re-scaling constraint shownin FIG. 5;

FIG. 7 shows in block diagrammatic form the implementation of a completesystem; and

FIG. 8 shows graphically one advantage of the current invention usingthe example of a simplified three dimensional input vector space.

DESCRIPTION OF A PREFERRED EMBODIMENT

The current invention would typically be implemented on a computersystem having some sort of analogue input, an analogue to digitalconverter, and digital processing means. The digital processing meanswould comprise a digital store and a processor. As shown in FIG. 1, aspeech recogniser embodiment typically has a microphone 1 acting as atransducer from the speech itself, the electrical output of which is fedto an analogue to digital converter (ADC) 2. There may also be someanalogue processing before the ADC (not shown). The ADC feeds its outputto a circuit 3 that divides the digital signal into 10 ms slices, andcarries out a spectral analysis on each slice, to produce a spectralvector. These spectral vectors are then fed into the signal processor 4,in which is implemented the current invention. The signal processor 4will have associated with it a digital storage 5. Some applications mayhave as an input a signal that has been digitised at some remote point,and so wouldn't have the ADC. Other hardware configurations are alsopossible within the scope of the current invention.

A typical signal processing system of the current invention willcomprise a simple GMM and a GMM-based HMM, together used to classify aninput signal. Before either of those models can be used forclassification purposes, they must first be optimised, or trained, usinga set of training data. There are thus two distinct modes of operationof a classification model: the training phase, and the classificationphase.

FIG. 2 shows generically the steps used by prior art systems in trainingboth a GMM and a HMM based classifier. FIG. 2 depicts the optimisationof hierarchical GMM-based HMMs as well as the optimisation of ordinaryGMM-based HMMs and simple GMMs, because the steps relating toinitialising and re-estimating HMM transition probabilities relate tothe initialisation and re-estimation of HMM transition probabilities atall levels of the hierarchy. The flow chart is entered from the top whenit is required to establish an improved set of parameters in the modelto improve the classification performance. First the various classesneed to be initialised, these being the class means, class covariancematrices and prior class probabilities. HMMs have the additional step ofinitialising the transition probabilities. These initialisation valuesmay be random, or they may be a “best guess” resulting either from someprevious estimation procedure or from some other method.

These initialisations form the adaptive parameters for the firstiteration of the training procedure, which proceeds as follows. An dataencoding vector or vector sequence (for the HMM case) from the trainingsequence is obtained, and processed using a known re-estimationprocedure. For GMMs the EM algorithm is often used, and for HMMs theBaum-Welch re-estimation procedure is commonplace. This is the innerloop of the re-estimation procedure, and is carried out for all dataencoding vectors in the training sequence.

Following this, the information gained during the inner loop processingis used to compute the new classes and, for the HMM case, the newtransition probabilities. Convergence of this new data is tested bycomparing it with the previous set or by judging whether the likelihoodfunction has achieved a stable minimum, and the process re-iterated ifnecessary using the newly computed data as a starting point.

Moving to the current invention, one embodiment of the current inventionapplied to speech recognition employs a modified spectral vector that ispre-processed in a manner that is different from the conventionallog-power representation of the prior art. The spectral vector itselfcomprises a spectral representation of a 10 ms slice of speech, dividedup into typically 25 frequency bins.

The objective of the first stage of the pre-processing is that elementsx_(i) (i=1, . . . , m) of the n-dimensional (m≦n) spectral vector xshould be proportional to the square roots √{square root over (Pi)} ofintegrated power P_(i) within different frequency bands, rather than theconventional logarithms of integrated power within different frequencybands. Further, elements x_(i) (i=1, . . . , m) should be scaled suchthat their squares should sum to a constant A that is independent of thetotal power integrated across all frequency bands within the framecorresponding to that spectral vector. Thus, if the frame is sampledinto m frequency bands, m of the elements x_(i) of the n-dimensional(m≦n) spectral vector x should satisfy

$\begin{matrix}\begin{matrix}{x_{i} = \sqrt{\frac{{AP}_{i}}{\sum\limits_{j = 1}^{m}P_{j}}}} & \left( {{i = l},\ldots\mspace{11mu},m} \right)\end{matrix} & \left( {{Equation}\mspace{14mu} 1} \right)\end{matrix}$which implies

${\sum\limits_{j = 1}^{m}x_{j}^{2}} = {A.}$

The value of the constant A has no functional significance; all thatmatters is that it doesn't change from one spectral vector to the next.The advantage of this normalised square root power representation forspectral vectors is that the degree of match of the shape of spectralvector x_(i) (i=1, . . . , m), compared with a class mean vector w_(i)(i=1, . . . , n), is then proportional to the scalar product

${\sum\limits_{i = 1}^{m}{x_{i}w_{i}}},$irrespective of the modulus (vector length) of the template. Thisprovides the freedom to constrain the modulus of the template withoutlosing the functionality of being able to determine the degree of matchof the template by computing the scalar product.

The steps involved in the novel encoding of spectral vectors arerepresented in the flow diagram of FIG. 3 and listed as follows (a-e).After (a) choosing a value for the constant A to be used for all framesof speech, (b) the first step to be applied for each individual frame ofspeech is the same as the conventional process for conducting a spectralanalyisis in order to obtain m values of the integrated power P_(i)(i=1, . . . , m) within m different frequency bands spanning the audiblefrequency range. Then, instead of taking the logarithms of thesepower-values as is conventional in the prior art, (c) their sum

$\sum\limits_{j = 1}^{m}P_{j}$and (d) their square roots √{square root over (Pi)} (i=l, . . . , m) arecomputed. (e) each square-root value √{square root over (Pi)} is thendivided by the square root of total power

$\sum\limits_{j = 1}^{m}P_{j}$(and multiplied by a constant scaling factor A as desired) to obtainelements x_(i) (i=1, . . . , m) of the novel encoding of the spectralvector defined by equation 1.

As a second part of the pre-processing of the spectral vectors, thevector is also augmented with the addition of extra elements thatrepresent the overall loudness of the speech at that frame, i.e. thetotal power

$\sum\limits_{j = 1}^{m}P_{j}$integrated across all frequency bands.

This is particularly useful in conjunction with the novel way ofencoding spectral shape defined by equation 1. This is because elementsx_(i) (i=1, . . . , m) defined by equation 1 are clearly independent ofthe overall loudness

$\sum\limits_{j = 1}^{m}P_{j}$and therefore encode no information about it, so those m elements needto be augmented with additional information if the spectral vector is toconvey loudness information.

In the current embodiment, two extra elements x_(m+1) and x_(m+2) areadded to the spectral vector, beyond the m elements used to encode thespectral shape. Thus the spectral vector will have n=m+2 dimensions.These two elements depend on the overall loudness

$L \equiv {\sum\limits_{j = 1}^{m}P_{j}}$in the following way:

$\begin{matrix}{{x_{m + 1} = {B\;\frac{f(L)}{\sqrt{\left\lbrack {f(L)} \right\rbrack^{2} + \left\lbrack {g(L)} \right\rbrack^{2}}}}},{x_{m + 2} = {B\;\frac{g(L)}{\sqrt{\left\lbrack {f(L)} \right\rbrack^{2} + \left\lbrack {g(L)} \right\rbrack^{2}}}}}} & \left( {{Equation}\mspace{14mu} 2} \right)\end{matrix}$where f( ) and g( ) are two (different) functions of the overallloudness L, and B is a constant. The significance of B is that the ratioB/A determines the relative contributions to the squared modulus

${x}^{2} = {{x \cdot x} = {\sum\limits_{j = 1}^{n}x_{j}^{2}}}$made by the two subsets of elements (i=m+1, m+2) and (i=1, . . . , m);the values of these contributions are clearly B² and A² respectively.The ratio B/A may therefore be used to control the relative importanceassigned to overall loudness and spectral shape in the coding ofspectral vectors; for example, choosing B=0 assigns no importance tooverall loudness, while choosing similar values of A and B assignssimilar importance to both aspects of the speech. The value of A²+B² canbe chosen to be 1 for simplicity, which will make the squared modulus

${x}^{2} = {{x \cdot x} = {{\sum\limits_{j = 1}^{n}x_{j}^{2}} = {A^{2} + B^{2}}}}$equal to 1 for all spectral vectors regardless of their speech content.

The advantages of this novel representation of loudness are (a) that themoduli of all spectral vectors will have the same constant valueregardless of overall loudness, which frees one to constrain the moduliof templates (class means) w=(w₁, . . . , w_(n)), as is proposed in themain claims, and (b) that the ratio B/A may be used to control therelative importance assigned to overall loudness and spectral shape inthe coding of spectral vectors. Possible choices for the functions f( )and g( ) include

$\begin{matrix}{{{f(L)} = {\sin\left( {\frac{\pi}{2}\frac{{\log\; L} - {\log\; L^{\min}}}{{\log\; L^{\max}} - {\log\; L^{\min}}}} \right)}},{{g(L)} = {\cos\left( {\frac{\pi}{2}\frac{{\log\; L} - {\log\; L^{\min}}}{{\log\; L^{\max}} - {\log\; L^{\min}}}} \right)}}} & \left( {{Equation}\mspace{14mu} 3} \right)\end{matrix}$where L^(min) and L^(max) are constants chosen to correspond to thequietest and loudest volumes (total integrated power) typicallyencountered in individual frames of speech.

Useful values for the pair of constants (A,B) are (1,0),

$\begin{matrix}\left( {\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}} \right) & {and} & {\left( {\sqrt{\frac{2}{n}},\sqrt{\frac{m}{n}}} \right),}\end{matrix}$which all satisfy A²+B²=1.

Once functions f( ) and g( ) and constants B, L^(min) and L^(max), to beused for all frames of speech, have been chosen, the steps involved inthe process required to incorporate the loudness encoding as describedabove are shown in FIG. 4. The process involves (a) summing theintegrated powers P_(i) within m frequency ranges i=1, . . . , m foreach frame of speech to obtain the overall loudness L for that frame ofspeech, (b) evaluating the two extra elements x_(m+1) and x_(m+2) forthat frame of speech according to equation 2, and (c) for that frame ofspeech appending the two extra elements to the m elements obtained fromthe process of FIG. 4 to obtain an n=m+2 dimensional spectral vectorincorporating the novel encodings of spectral shape and loudness.

The steps as shown in FIGS. 3 and 4 comprise the pre-processing of thespectral vectors according to the embodiment of the current invention.

The input vectors pre-processed as described above are used whenoptimising the various parameters of the GMMs and GMM-based HMMs. Theinner loop of the optimisation procedure, as described in relation toFIG. 1 above, is done using convention methods such as EM re-estimationand Baum-Welch re-estimation, respectively. Further novel stages areconcerned with applying constraints to the parameters in betweeniterations of this inner loop.

FIG. 5 shows the re-estimation procedure of the current invention, withadditional processes present as compared to that shown in FIG. 2. Theseadditional processes relate to the initialisation of the classes beforethe iterative part of the procedure starts, and to the resealing of theclass means following each iteration to take into account theconstraints to be imposed. Note that for the HMM case the transitionprobability processing is unchanged from the prior art.

One of the constraints applied in between iterations of the inner loopis concerned with the class mean vectors of the GMM or HMM. Theconstraint takes the form of re-scaling the set of n-dimensional vectorsw_(j)=(w_(j1), . . . w_(jn)) which represent the class means.

This constraint is applied to all the class means, as soon as they havebeen re-estimated, every time they are re-estimated (by the EM orBaum-Welch re-estimation procedures for example), and also when they arefirst initialised (see FIG. 5). These extra steps, illustrated in theflow diagram of FIG. 5, are (a) by summing the squares of its elementsand then taking the square root of the sum, the modulus |w_(j)| of eachof the N re-estimated class means w_(j) is first computed as

$\begin{matrix}{{w_{j}} = \sqrt{\sum\limits_{i = 1}^{n}w_{ji}^{2}}} & \left( {{Equation}\mspace{14mu} 4} \right)\end{matrix}$for all N classes j=1, . . . , N; (b) after computing the modulus|w_(j)| of each re-estimated class mean, all the elements of each classmean are divided by that corresponding modulus, i.e.

$\begin{matrix}{{\left. w_{ji}\rightarrow{D\;\frac{w_{ji}}{w_{j}}} \right.,{{{for}\mspace{14mu}{all}\mspace{14mu}{elements}\mspace{14mu} i} = 1},\ldots\mspace{11mu},{n\mspace{14mu}{of}\mspace{14mu}{all}}}{{{{GMM}\mspace{14mu}{classes}\mspace{14mu} j} = 1},\ldots\mspace{11mu},N}} & \left( {{Equation}\mspace{14mu} 5} \right)\end{matrix}$

These steps have the effect or re-scaling all the class means w_(j) toconstant modulus D until the next iteration of their re-estimation,after which they are re-scaled again to constant modulus D by applyingthese steps again, as depicted in FIG. 5. The value of the constant D ispreferably set equal to the modulus |x| of the data vectors x. (Forexample, for a GMM receiving input data having moduli |x|=√{square rootover (A²+B²)}, the value of D should be set equal to √{square root over(A²+B²)}.)

The advantages of re-scaling the class means to constant modulus arethat this encourages speech recognition algorithms to adopt novelencodings of speech data that may improve speech classificationperformance (such as hierarchical sparse coding), and that it may reducethe vulnerability of speech recognition algorithms to becoming trappedin undesirable sub-optimal configurations (‘local minima’) duringtraining. These advantages result from the fact that the dynamics oflearning have simplified degrees of freedom because the class means areconstrained to remain on a hypersphere (of radius D) as they adapt.

Re-scaling class means w_(j) to constant modulus is particularlyappropriate in conjunction with scaling data vectors x to constantmodulus. This is because the degree of match between a data vector x anda class mean w_(j) can then determined purely by computing the scalarproduct w_(j)·x.

Further to this embodiment of the current invention, the covariancematrices C_(j) of the Gaussian distributions that constitute the GMMsare constrained to be isotropic and of constrained variance V, i.e. theyare not optimised according to the conventional re-estimation proceduresfor covariance matrices (such as the EM algorithm for GMMs and theBaum-Welch procedure for GMM-based HMMs), but are defined once and forall in terms of the isotropic Identity Matrix I and the constrainedvariance V byC_(j)≡VI for all classes j=1, . . . , N   (Equation 6)

V is a free parameter chosen (for example by trial and error) to givethe speech recognition system best classification performance; V must begreater than zero, as a covariance matrix has non-negative eigenvalues,and V is preferably significantly smaller than the value of D². Thebenefit of setting V much smaller than D² is that it leads to a sparsedistribution of the first level simple GMM's posterior probabilities,which in the main embodiment feed the data encoding vector space of theGMM-based HMM at the second level. This is because each Gaussiancomponent of the first level simple GMM will individually only span asmall area on the spectral vector hypersphere.

This process for choosing covariance matrices involves the followingsteps: (a) choosing a value for the constant of proportionality V so asto optimise the classification performance, for example by trial anderror, (b) setting all the diagonal elements of the class covariancematrices equal to V, and (c) setting all the off-diagonal elements ofthe class covariance matrices equal to zero. Thus, the covariance matrixaccording to this embodiment of the present invention is both isotropicand diagonal.

Used in conjunction with the above techniques for constraining themoduli of data vectors x and class means w_(j), constraining the classcovariances in this way gives the advantage of encouraging speechrecognition algorithms to adopt novel encodings of speech data that mayimprove speech recognition performance (such as hierarchical sparsecoding), and reducing the vulnerability of speech recognition algorithmsto becoming trapped in undesirable sub-optimal configurations (‘localminima’) during training. Sparse coding results from representingindividual speech-sounds as assemblies of many small isotropic Gaussianhypercircles tiling the unit hypersphere in the spectral-vector space,offering in the potential for more faithful representation of highlycomplex-shaped speech-sounds than is permitted by representation as asingle anisotropic ellipsoid, and thus improved classificationperformance.

Because this constraint does away with the need for the conventionalunconstrained re-estimation of the covariance matrices, FIG. 5'smodified procedure for optimising GMMs does not involve re-estimation ofcovariance matrices as does the conventional procedure of FIG. 2.

In the case wherein the covariance matrix is constrained to beisotropic, it is well known that each class likelihood of a GMM (fromwhich the GMM's posterior probabilities are derived via the well-knownBayes' theorem) is calculated from the modulus of the vector difference|x−w| between the data-encoding vector x and the appropriate class meanw. It is well known that these quantities can be derived from the scalarproduct x·w of the data-encoding vector x and the class mean w, from therelation |x−w|²=|x|²+|w|²−2x·w. In the case of an exponential mixturemodel, the class likelihoods are computed directly from the scalarproduct x·w. In cases where a set {w} of N class means are equivalent toone another by translation transformations (such as 2-dimensionaltranslations in an image plane in cases when the data-encoding vectorsrepresent images or 1-dimensional translations in time in cases when thedata-encoding vectors represent 1-dimensional time signals), thewell-known “correlation theorem” provides a much more computationallyefficient means of calculating the corresponding set {x·w} of N scalarproducts with a given data-encoding vector x than is provided byperforming N scalar product operations explicitly; the equivalent resultmay instead be obtained by computing the inverse Fourier transform ofthe component-wise product of the Fourier transform of x with thedirection-reverse of the Fourier transform of w. In this way the desiredresult {x·w} may be obtained in the order of N·log(N) steps instead ofN² steps. Further details of this can be found in the prior art of C. J.S. Webber, “Signal Processing Technique”, PCT publication No. WO01/61526. The present invention may be applied to GMMs and/or GMM-basedHMMs regardless of whether or not the correlation theorem is used toaccelerate the computation of a such a set of translation-related scalarproducts {x·w}.

A further constraint imposed on this embodiment of the current inventionrelates to the choice of prior class probabilities. The N priorprobabilities Pr(j) for the GMM classes j=1, . . . , N may beconstrained to be constants, i.e. not optimised according to theconventional re-estimation procedures for prior class probabilities(such as the EM algorithm for GMMs and the Baum-Welch procedure forGMM-based HMMs), but are defined once and for all by the step of settingPr(j)=1/N for all classes j=1, . . . , N   (Equation 7)

Used in conjunction with the above innovations for constraining themoduli of data vectors x, class means w_(j) and the covariance matricesC_(j), constraining the prior class probabilities in this way gives theadvantage of reducing the vulnerability of speech recognition algorithmsto becoming trapped in undesirable sub-optimal configurations (‘localminima’) during training. Because this innovation does away with theneed for the conventional unconstrained re-estimation of the prior classprobabilities, FIG. 5's modified procedure for optimising GMMs does notinvolve re-estimation of prior class probabilities as does theconventional procedure of FIG. 2.

It will be understood by people skilled In the relevant arts that theconstraints applied to a GMM or HMM as described above in the trainingphase of the model will equally need to be applied during theclassifying phase of use of the models. If they were employed duringtraining, the steps for encoding spectral shape and overall loudnessaccording to the present invention as described above will need to beapplied to every spectral vector of any new speech to be classified.

An implementation of the invention, which combines all of theconstraints detailed above, is illustrated in FIG. 6. Thisimplementation uses conventional spectral analysis of each frame ofspeech, followed by novel steps described above to encode both spectralshape and overall loudness into each spectral vector and to scale everyspectral vector's modulus to the constant value of 1. The parameters Aand B are both set to equal 1/√{square root over (2)} and D is set equalto 1.

Such unit-modulus spectral vectors are input to a GMM having a hundredGaussian classes (N=100), with class means all constrained to havemoduli equal to 1, with class prior probabilities all constrained tohave constant and equal values of 1/100, and covariance matricesconstrained to be isotropic and to have constant variances (i.e. notre-estimated at each iteration according to a procedure such as the EMalgorithm). A good choice for that constant variance V has been found tobe 0.01, although other values could be chosen by trial and error so asto give best speech classification performance of the whole system; theright choice for V will lie between 0 and 1. For each spectral vectorinput to this GMM, posterior probabilities for the classes are computedin the conventional way.

Each set of GMM posterior probabilities computed above for each spectralvector are used to compute unit-modulus data-encoding vectors for inputto an ordinary GMM-based HMM by taking the square roots of thoseposterior probabilities.

These unit-modulus data-encoding vectors are input to the HMM asobservation vectors. The class means of the Gaussian mixture thatconstitutes the parameterisation of the HMM's observation probabilitiesare all constrained to have moduli equal to 1. The number N of Gaussianclasses used to parameterise the HMM's observation probabilities ischosen by trial and error so as to give best speech classificationperformance of the whole system. The prior probabilities of thoseclasses are then determined by that choice of N; they are allconstrained and set equal to 1/N. The covariance matrices of thoseclasses are all constrained to be isotropic and to have constantvariances (i.e. not re-estimated unconstrained according to a proceduresuch as the EM algorithm). The choice of that constant variance V wouldbe determined by trial and error so as to give best speechclassification performance of the whole system; the right choice for Vwill lie between 0 and 1.

The preferred implementation of the invention can be operated intraining mode and classification mode. In classification mode, the HMMis used to classify the input observation vectors according to aconventional HMM classification method (Baum-Welch forward-backwardalgorithm or Viterbi algorithm), subject to the modifications describedabove.

In training mode, (a) the GMM is optimised for the training ofunit-modulus spectral vectors (encoded as described above) according toa conventional procedure for optimising GMM class means (e.g. the EMre-estimation algorithm), subject to the innovative modifications tore-scale the GMM class means to have constant moduli equal to 1, and toomit the conventional steps for re-estimating the GMM class covariancematrices and prior class probabilities. (b) Once the GMM has beenoptimised, it is used as described above to compute a set ofdata-encoding vectors from the training set of speech spectral vectors.(c) This set of data-encoding vectors is then used for training the HMMaccording to a conventional procedure for optimising HMM class means(e.g. the Baum-Welch re-estimation procedure), subject to the innovativemodifications to re-scale the HMM class means to have constant moduliequal to 1, and to omit the conventional steps for re-estimating the HMMclass covariance matrices and prior class probabilities. No modificationis made to the conventional steps for re-estimating HMM transitionprobabilities; the conventional Baum-Welch re-estimation procedure maybe used for re-estimating HMM transition probabilities.

FIG. 8 illustrates the advantage of employing the constraints of thecurrent invention. This shows a spectral vector x=(x₁, x₂, x₃), where|x|=1. Constraining this spectral vector, e.g. 101 into having aconstant modulus has the implication that the class means 102 will alllie on the surface of a hypersphere. In the case shown the hyperspherehas two dimensions, and so is an ordinary 2-sphere 103 in an ordinarythree-dimensional space. Constraining the covariance matrices to beisotropic and diagonal has the effect that the individual classes willproject onto this hypersphere in the form of circles 104. Thisarrangement allows individual speech-sounds to be represented in thespectral-vector space not as individual Gaussian ellipsoids, as isconventional, but as assemblies 105 of many smaller Gaussianhypercircles 104 tiling the unit hypersphere 103, offering in thepotential for more faithful representation of highly complex-shapedspeech-sounds, and thus improved classification performance. Each class(hypercircle) eg 104 will span just a small area within the complexshape that delimits the set of all spectral vectors (which must all lieon the spectral-vector hypersphere 103) that could correspond toalternative pronunciations of a particular individual speech-sound;collectively, many such classes 104 will be able to span that wholecomplex shape much more faithfully than could a single, anisotropicellipsoid as is conventionally used to represent an individual speechsound. Other sets of Gaussian classes within the same mixture model willbe able to span parts of other complex shapes on the spectral vectorhypersphere, i.e. of other speech sounds. The posterior probabilitiesassociated with each of these Gaussian classes (hypercircles) is ameasure of how close the current spectral vector is (on thespectral-vector hypersphere) to the corresponding Gaussian class mean102 (hypercircle centre). Learning which sets of classes correspond towhich speech sounds, on the basis of all the temporal correlationsbetween them that are present in the training speech sequences, is thefunction of the GMM-based HMM, whose inputs are fed from the set of allthose posterior probabilities.

To use an analogy, a large number of hypercircles helps one to avoidlocal minima far better than would a small number of anisotropicellipsoids, for effectively the same reason that a bunch of sticks getstangled more easily than a tray of marbles. (In this analogy, minimisingthe total gravitational potential of the set of marbles plays theanalogous role to maximising the model likelihood.) Similarly, one canmap out highly complex shapes much more faithfully by using a lot ofmarbles than by using a few sticks.

The skilled person will be aware that other embodiments within the scopeof the invention may be envisaged, and thus the invention should not belimited to the embodiments as herein described.

REFERENCES

A. R. Webb, Statistical Pattern Recognition, Arnold (London), 1999.

B. H. Juang & L. R. Rabiner, Hidden Markov models for speechrecognition, Technometrics 33(3), American Statistical Association,1991.

1. A signal processing system for processing a plurality ofmulti-element data encoding vectors, the system: having means forderiving the data encoding vectors from input signals; being arranged toprocess the data encoding vectors using a Gaussian Mixture Model (GMM)based Hidden Markov Model (HMM), the GMM based HMM having at least oneclass mean vector having multiple elements; being arranged to processthe elements of the class mean vector(s) by an iterative optimisationprocedure; characterised in that the system is also arranged to scalethe elements of the class mean vector(s) during the optimisationprocedure to provide for the class mean vector(s) to have constantmodulus at each iteration, and to normalise the data encoding vectorsinput to the GMM based HMM.
 2. A system as claimed in claim 1 whereinthe GMM based HMM has a covariance matrix, the elements of which remainconstrained during the optimisation procedure such that the matrix isisotropic and diagonal, and the value of the non zero diagonal elementsremain constant throughout the optimisation procedure.
 3. A system asclaimed in claim 1 wherein prior class probabilities associated with theGMM based HMM are constrained to be equal, and to remain unchangedthroughout the optimisation procedure.
 4. A system as claimed in claim 1wherein the data encoding vectors are normalised such that the vectorshave equal moduli.
 5. A system as claimed in claim 4 wherein the modulusof each data encoding vector is independent of the overall spectralpower in the vector.
 6. A system as claimed in claim 4 wherein elementsforming spectral coefficients of a data encoding vector are arranged tobe individually proportional to the square root of the power in theircorresponding spectral band divided by the square root of the overallpower contained in spectral bands represented in the vector.
 7. A systemas claimed in claim 4 wherein the system is arranged to add at least oneadditional element to each data encoding vector, wherein the addedelement(s) encode the overall power contained in spectral bandsrepresented in the vector.
 8. A system as claimed in claim 7 wherein thesystem is arranged to add two elements to each data encoding vector torepresent the overall power in spectral bands, these two elementsarranged such that the sum of their squares is a constant across alldata encoding vectors that represent the spectrum of the input signal.9. A system as claimed in claim 1 wherein the GMM based HMM provides theobservation probabilities for a higher level HMM.
 10. A system asclaimed in claim 1 wherein the derivation of the data encoding vectorsfrom the input signal involves the use of a low level GMM, whereby thislow level GMM provides the data encoding vectors to the GMM based HMMthat comprise elements derived from the low level GMM's posteriorprobabilities.
 11. A system as claimed in claim 10 wherein elements ofthe data encoding vectors input from the low level GMM to the GMM basedHMM are proportional to the square root of posterior probabilities ofthe low level GMM.
 12. A system as claimed in claim 10 wherein elementsof the data encoding vectors input from the low level GMM to the GMMbased HMM are proportional to posterior probabilities of the low levelGMM.
 13. A system as claimed in claim 9 wherein the constant values forthe modulus of each of the class mean vectors may be different at eachlevel.
 14. A signal processing system for processing a plurality ofmulti-element data encoding vectors, the system: having means forderiving the data encoding vectors from input signals; being arranged toprocess the data encoding vectors using a Gaussian Mixture Model (GMM),the GMM having at least one class mean vector having multiple elements;being arranged to process the elements of the class mean vector(s) by aniterative optimisation procedure; characterised in that the system isalso arranged to scale the elements of the class mean vector(s) duringthe optimisation procedure to provide for the class mean vector(s) tohave constant modulus at each iteration, and to normalise the dataencoding vectors input to the GMM.
 15. A system as claimed in claim 14wherein the GMM has a covariance matrix, the elements of which remainconstrained during the optimisation procedure such that the matrix isisotropic and diagonal, and the value of the non zero diagonal elementsremain constant throughout the optimisation procedure.
 16. A system asclaimed in claim 14 wherein prior class probabilities associated withthe GMM are constrained to be equal, and to remain unchanged throughoutthe optimisation procedure.
 17. A system as claimed in claim 14 whereinthe data encoding vectors are normalised such that the vectors haveequal moduli.
 18. A system as claimed in claim 17 wherein the modulus ofeach data encoding vector is independent of the overall spectral powerin the vector.
 19. A system as claimed in claim 17 wherein elementsforming spectral coefficients of a data encoding vector are arranged tobe individually proportional to the square root of the power in theircorresponding spectral band divided by the square root of the overallpower contained in spectral bands represented in the vector.
 20. Asystem as claimed in claim 17 wherein the system is arranged to add atleast one additional element to each data encoding vector, wherein theadded element(s) encode the overall power contained in spectral bandsrepresented in the vector.
 21. A system as claimed in claim 20 whereinthe system is arranged to add two elements to each data encoding vectorto represent the overall power in spectral bands, these two elementsarranged such that the sum of their squares is a constant across alldata encoding vectors that represent the spectrum of the input signal.22. A system as claimed in claim 14 wherein the derivation of the dataencoding vectors from the input signal involves the use of a second, lowlevel GMM, whereby this second GMM provides the data encoding vectors tothe original GMM that comprise elements derived from the second GMM'sposterior probabilities.
 23. A system as claimed in claim 22 whereinelements of the data encoding vectors input from the second GMM to theoriginal GMM are proportional to the square root of posteriorprobabilities of the second GMM.
 24. A system as claimed in claim 22wherein elements of the data encoding vectors input from the second GMMto the original GMM are proportional to posterior probabilities of thesecond GMM.
 25. A system as claimed in claim 22 wherein the constantvalues for the modulus of each of the class mean vectors may bedifferent at each level.
 26. A method of processing a signal, the signalcomprising a plurality of multi-element data encoding vectors, whereinthe data encoding vectors are derived from an analogue or digital input,and where the method employs at least one Gaussian Mixture Model (GMM)or GMM based Hidden Markov Model (HMM), the GMM or GMM based HMM havingat least one class mean vector having multiple elements, and theelements of the class mean vector(s) are optimised in an iterativeprocedure, characterised in that the elements of the class mean vectorsare scaled during the optimisation procedure such that the class meanvectors have a constant modulus at each iteration, and the data encodingvectors input to the GMM or GMM based HMM are processed such that theyare normalised.
 27. A method as claimed in claim 26 wherein a covariancematrix within the GMM or GMM based HMM has one or more elements, all ofwhich are constrained during the optimisation procedure such that thematrix is isotropic and diagonal, and the value of its non zero elementsremain constant throughout the optimisation procedure.
 28. A method asclaimed in claim 26 wherein prior class probabilities associated withthe GMM or GMM based HMM are constrained to be equal, and to remainunchanged throughout the optimisation procedure.
 29. A method as claimedin claim 26 wherein the data encoding vectors are scaled in apre-processing stage before being input to the GMM or GMM based HMM,such that the moduli of all data encoding vectors are equal.
 30. Amethod as claimed in claim 29 wherein the modulus of each data encodingvector is independent of the overall power in the vector.
 31. A methodas claimed in claim 29 wherein elements forming spectral coefficients ofa data encoding vector are arranged to be individually proportional tothe square root of the power in their corresponding spectral band,divided by the square root of the overall power contained in spectralbands represented in the vector.
 32. A method as claimed in claim 29wherein at least one additional element is added to each data encodingvector, wherein the added element(s) encode the overall power containedin spectral bands represented in the vector.
 33. A method as claimed inclaim 32 wherein two elements are added to each data encoding vector torepresent the overall power in spectral bands, these two elementsarranged such that the sum of their squares is a constant across allinput vectors that represent the spectrum of the input signal.
 34. Amethod as claimed in claim 26 wherein the GMM or GMM based HMM providesthe observation probabilities for a higher level HMM.
 35. A method asclaimed in claim 26 wherein the derivation of the data encoding vectorsfrom the input signal involves the use of a low level GMM, whereby thislow level GMM provides the data encoding vectors to the GMM or GMM basedHMM that comprise elements derived from the low level GMM's posteriorprobabilities.
 36. A method as claimed in claim 35 wherein elements ofthe data encoding vectors input from the low level GMM to the GMM or GMMbased HMM are proportional to the square root of posterior probabilitiesof the low level GMM.
 37. A method as claimed in claim 35 whereinelements of the data encoding vectors input from the low level GMM tothe GMM or GMM based HMM are proportional to posterior probabilities ofthe low level GMM.
 38. A method as claimed in claim 34 wherein theconstant values for the modulus of each of the class mean vectors may bedifferent at each level.
 39. A signal processing system that has beentrained according to the method as described in claim
 26. 40. A computerprogrammed to implement a signal processing system for processing one ormore multi-element input vectors, the system: having means for derivingthe data encoding vectors from input signals; being arranged to processthe data encoding vectors using a at least one of a Gaussian MixtureModel (GMM) and a GMM based Hidden Markov Model (HMM), the GMM or GMMbased HMM having at least one class mean vector having multipleelements; being arranged to process the elements of the class meanvector(s) by an iterative optimisation procedure; characterised in thatthe system is also arranged to scale the elements of the class meanvector(s) during the optimisation procedure to provide for the classmean vector(s) to have constant modulus at each iteration, and tonormalise the data encoding vectors input to the GMM or GMM based HMM.41. A speech recogniser incorporating a signal processing system forprocessing one or more multi-element input vectors, the recogniser:having means for deriving the data encoding vectors from input signals;being arranged to process the data encoding vectors using at least oneof a Gaussian Mixture Model (GMM) and a GMM based Hidden Markov Model(HMM), the GMM or GMM based HMM having at least one class mean vectorhaving multiple elements; being arranged to process the elements of theclass mean vector(s) by an iterative optimisation procedure;characterised in that the system is also arranged to scale the elementsof the class mean vector(s) during the optimisation procedure to providefor the class mean vector(s) to have constant modulus at each iteration,and to normalise the data encoding vectors input to the GMM or GMM basedHMM.